nv-hostengine
的启动和补充知识
TL;DR
如果是想要解决在使用 dcgmi
和 dcgmi discovery -l
相关的命令的时候遭遇的 unable to establish a connection to the specified host: localhost
和 Unable to connect to host engine. Host engine connection invalid/disconnected.
错误,可以通过在 dcgm-exporter
Pod 中直接调用 nv-hostengine
二进制命令启动 nv-hostengine
来启动 nv-hostengine
启动一个 nv-hostengine
实例
在接下来的例子中我都会直接使用 GPU Operator 自带和附属的 dcgm-exporter
Pod 资源内置的工具和命令行,这样节点设备上就无需再安装和折腾 NVIDIA DCGM (Data Center GPU Manager) 了。
找到 dcgm-exporter
的 Pod ID 并进入 Pod 之后可以通过
ps -A
查看当前 dcgm-exporter
容器内所有的进程来确认是否开启了 nv-hostengine
:
root@nvidia-dcgm-exporter-pkcrb:/$ ps -A
PID TTY TIME CMD
1 ? 00:00:02 dcgm-exporter
40 pts/0 00:00:00 bash
48 pts/0 00:00:00 ps
在启动前,也需要检查是否在 PATH
指定的路径中能搜索到 nv-hostengine
的二进制文件:
root@nvidia-dcgm-exporter-pkcrb:/$ which nv-hostengine
/usr/bin/nv-hostengine
然后直接调用二进制启动就好了,接下来 nv-hostengine
就会开始在 localhost:5555
监听并处理未来的请求:
root@nvidia-dcgm-exporter-pkcrb:/$ nv-hostengine
Started host engine version 3.3.0 using port number: 5555
拓展知识
配合 dcgmi
模拟 DCGM 的错误指标和数据
DCGM 的文档中有对错误模拟和注入的文档:Error Injection — NVIDIA DCGM Documentation latest documentation
它的步骤和逻辑是这样的:
- 启动
nv-hostengine
守护进程 - 通过 DCGM 启动监控
- 确定要注入的 GPU 错误
- 通过
dcgmi test --inject
注入错误 - DCGM 应该会提示出对应的错误
nv-hostengine
的奇怪僵尸进程行为
启动 nv-hostengine
之后通过
ps -A
再次检查进行中的进程的时候会发现有一个 <defunct>
的 nv-hostengine
进程:
root@nvidia-dcgm-exporter-pkcrb:/$ ps -A
PID TTY TIME CMD
1 ? 00:00:04 dcgm-exporter
61 pts/0 00:00:00 bash
71 ? 00:00:00 nv-hostengine <defunct>
72 ? 00:00:00 nv-hostengine
79 pts/0 00:00:00 ps
虽然按照文档介绍的来说的话,不需要的时候可以通过
nv-hostengine -t
去关闭 nv-hostengine
,但是实际上跑完之后会报错说无法关闭
nv-hostengine -t
Host engine already running with pid 72
Unable to terminate host engine.
就像这样:
root@nvidia-dcgm-exporter-4f599:/$ nv-hostengine
Host engine already running with pid 357
root@nvidia-dcgm-exporter-4f599:/$ ps -A
PID TTY TIME CMD
1 ? 00:41:59 dcgm-exporter
320 ? 00:00:00 dcgmproftester1 <defunct>
321 ? 00:00:00 dcgmproftester1 <defunct>
322 ? 00:00:00 dcgmproftester1 <defunct>
323 ? 00:00:00 dcgmproftester1 <defunct>
324 ? 00:00:00 dcgmproftester1 <defunct>
325 ? 00:00:00 dcgmproftester1 <defunct>
326 ? 00:00:00 dcgmproftester1 <defunct>
327 ? 00:00:00 dcgmproftester1 <defunct>
356 ? 00:00:00 nv-hostengine <defunct>
357 ? 00:00:00 nv-hostengine <defunct>
404 pts/0 00:00:00 bash
415 pts/0 00:00:00 ps
这个时候我找到的唯一的解决办法是通过 kubectl
把 dcgm-exporter
的 Pod 删掉来解决的。
root@nvidia-dcgm-exporter-cq8bw:/$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA GeForce RTX 3090 |
| | PCI Bus ID: 00000000:1B:00.0 |
| | Device UUID: GPU-69cd4dac-cefb-45e8-9bff-429dfd4bc8c3 |
+--------+----------------------------------------------------------------------+
| 1 | Name: NVIDIA GeForce RTX 3090 |
| | PCI Bus ID: 00000000:1C:00.0 |
| | Device UUID: GPU-ac2d4b59-b78e-42c7-8db6-ef28b01540e1 |
+--------+----------------------------------------------------------------------+
| 2 | Name: NVIDIA GeForce RTX 3090 |
| | PCI Bus ID: 00000000:1D:00.0 |
| | Device UUID: GPU-d8f40ec7-2c07-4d2e-ae9b-a51e7377e052 |
+--------+----------------------------------------------------------------------+
| 3 | Name: NVIDIA GeForce RTX 3090 |
| | PCI Bus ID: 00000000:1E:00.0 |
| | Device UUID: GPU-5f969c85-ffc4-44ac-aa13-10824844087e |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
2 CPUs found.
+--------+----------------------------------------------------------------------+
| CPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: Grace TH500 |
| | Cores: 0-19,40-59 |
+--------+----------------------------------------------------------------------+
| 1 | Name: Grace TH500 |
| | Cores: 20-39,60-79 |
+--------+----------------------------------------------------------------------+
使用 GPU Operator 自带的内置 DCGM 运行嵌入 DCGM Engine
如果你在尝试研究使用嵌入 DCGM Engine,可以参考一下这个 Error starting embedded DCGM engine · Issue #16 · triton-inference-server/model_analyzer Issue 来配合解决问题。
一般情况下安装 GPU Operator 都不会启动所谓的「嵌入 DCGM Engine」,可以通过 GPU Operator 的源代码来看:https://github.com/NVIDIA/gpu-operator/blob/4113883838a514cf528ae67f3cf599f79b52fc02/deployments/gpu-operator/values.yaml#L283-L294
而渲染的目标资源类型是 ClusterPolicy
,资源名称是 cluster-policy
,这个时候可以通过
sudo kubectl get ClusterPolicy cluster-policy -o yaml
命令查看到当前该资源的配置详细信息:
dcgm:
enabled: false
hostPort: 5555
image: dcgmçç
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: 3.3.0-1-ubuntu22.04
可以发现,由于 enabled
的值最终是 false
,所以这里面确实没有打开 dcgm
模块。
只需要通过
sudo kubectl edit ClusterPolicy cluster-policy
对 ClusterPolicy
资源进行编辑并修改 dcgm.enabled
为 true
即可,然后等待 Pod 启动完成就可以继续用了。
延伸阅读
参考资料
- Kubernetes概览 — Cloud Atlas beta 文档 和 cloud-atlas/source/kubernetes/gpu/dcgm-exporter.rst at a168158169f5aad0f7683d698cc740f1846374fe · huataihuang/cloud-atlas (github.com) 和 cloud-atlas/source/kubernetes/gpu/dcgm-exporter.rst at a168158169f5aad0f7683d698cc740f1846374fe · huataihuang/cloud-atlas (github.com) 的代码实现中引用到了
nv-hostengine
- Getting Started — gpu-operator 23.6.0 documentation
- Azure 的 Moneo 有引用和调用到
nv-hostengine
:Moneo/src/worker/shutdown.sh at 70c0a2d75355d82909784c886ed0fc169a49a033 · Azure/Moneo (github.com) - DataDog 的 GPU 集成也有用到
nv-hostengine
:integrations-core/dcgm/README.md at 764d52840cf6cf694d0eaec5929fc7d799b7fc29 · DataDog/integrations-core (github.com) - llm-action 项目也有引用到:github.com/liguodongiot/llm-action/blob/c9d8a3570b9e156ef6af035cf1b21294cb3eedbf/docs/llm-base/monitor.md?plain=1#L13-L16
- 在 NVIDIA 官方的 DCGM 的 Dockerfile 中有直接对
nv-hostengine
的使用,也可以在这里看到: DCGM/docker/Dockerfile.ubi8 at a33560c9c138c617f3ee6cb50df11561302e5743 · NVIDIA/DCGM (github.com) - Monitor Your Computing System with Prometheus, Grafana, Alertmanager, and Nvidia DCGM
- Error: unable to establish a connection to the specified host: localhost · Issue #43 · NVIDIA/DCGM